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Abstract —Mining useful clusters from high dimensional data 
has received significant attention of the computer vision and 
pattern recognition community in the recent years. Linear and 
non-linear dimensionality reduction has played an important role 
to overcome the curse of dimensionality. Hovt^ever, often such 
methods are accompanied with three different problems: high 
computational complexity (usually associated with the nuclear 
norm minimization), non-convexity (for matrix factorization 
methods) and susceptibility to gross corruptions in the data. In 
this paper we propose a principal component analysis (PCA) 
based solution that overcomes these three issues and approxi¬ 
mates a low-rank recovery method for high dimensional datasets. 
We target the low-rank recovery by enforcing two types of 
graph smoothness assumptions, one on the data samples and 
the other on the features by designing a convex optimization 
problem. The resulting algorithm is fast, efficient and scalable 
for huge datasets with 0{n\og{n)) computational complexity in 
the number of data samples. It is also robust to gross corruptions 
in the dataset as well as to the model parameters. Clustering 
experiments on 7 benchmark datasets with different types of 
corruptions and background separation experiments on 3 video 
datasets show that our proposed model outperforms 10 state-of- 
the-art dimensionality reduction models. Our theoretical analysis 
proves that the proposed model is able to recover approximate 
low-rank representations with a bounded error for clusterable 
data. 

Index Terms —robust PCA, graph, structured low-rank repre¬ 
sentation, spectral graph theory, graph regularized PCA 

I. Introduction 

In the modern era of data explosion, many problems 
in signal and image processing, machine learning and pat¬ 


tern recognition require dealing with very high dimensional 
datasets, such as images, videos and web content. The data 
mining community often strives to reveal natural associations 
or hidden structures in the data. Over the past couple of 
decades matrix factorization has been adopted as one of the 
key methods in this context. Given a data matrix X G 
with n p-dimensional data vectors, the matrix factorization can 
be stated as determining V and W G such that 

X ^VW under different constraints on V and W. 

How can matrix factorization extract structures in the data? 
The answer to this question lies in the intrinsic association of 
linear dimensionality reduction with matrix factorization. Con¬ 
sider a set of gray-scale images of the same object captured 
under fixed lighting conditions with a moving camera, or a set 
of hand-written digits with different rotations. Given that the 
image has pixels, each such data sample is represented by 
a vector in . However, the intrinsic dimensionality of the 
space of all images of the same object captured with small 
perturbations is much lower than Thus, dimensionality 
reduction comes into play. Depending on the application and 
the type of data, one can either use a single linear subspace to 
approximate the data of different classes using the standard 
Principal Component Analysis (PCA) [1], a union of low 
dimensional subspaces where each class belongs to a different 
subspace (LRR and SSC) [10], [21], [39], [34], or a positive 
subspace to extract a positive low-rank representation of the 
data (NMF) [20]. The clustering or community detection can 
then be performed on the retrieved data representation in 


















the low dimensional space. Not surprisingly, all the above 
mentioned problems can be stated in the standard matrix 
factorization manner as shown in the models 1 to 3 of Fig. 1. 
Alternatively, the clustering quality for non-linearly separable 
datasets can be improved by using non-linear dimensionality 
reduction tools such as Laplacian Eigenmaps [4] or Kernel 
PCA [33]. 

In many cases low dimensional data follows some additional 
structure. Knowledge of such structure is beneficial, as we 
can use it to enhance the representativity of our models 
by adding structured priors [14], [23], [40]. A nowadays 
standard way to represent pairwise affinity between objects 
is by using graphs. The introduction of graph-based priors to 
enhance matrix factorization models has recently brought them 
back to the highest attention of the data mining community. 
Representation of a signal on a graph is well motivated by 
the emerging field of signal processing on graphs, based 
on notions of spectral graph theory [37]. The underlying 
assumption is that high-dimensional data samples lie on or 
close to a smooth low-dimensional manifold. Interestingly, the 
underlying manifold can be represented by its discrete proxy, 
i.e. a graph. Let G = (V,f) be a graph between the samples 
of X, where S is the set of edges and V is the set of vertices 
(data samples). Let A be the symmetric matrix that encodes 
the weighted adjacency information between the samples of 
X and D is the diagonal degree matrix with Da = Aij . 
Then the normalized graph Laplacian C that characterizes the 
graph G is defined as £ = — A)D~^^‘^. Exploiting 

the manifold information in the form of a graph can be seen 
as a method of incorporating local proximity information of 
the data samples into the dimensionality reduction framework, 
that can enhance the clustering quality in the low-dimensional 
space. 

A. Focus of this work 

In this paper, we focus on the application of PCA to 
clustering, projecting the data on a single linear subspace. We 
first describe PCA and its related models and then elaborate 
on how the data manifold information in the form of a graph 
can be used to enhance standard PCA. Finally, we present a 
novel, convex, fast and scalable method for PCA that recovers 
the low-rank representation via two graph structures. Our 
theoretical analysis proves that the proposed model is able to 
recover approximate low-rank representations with a bounded 
error for clusterable data, where the number of clusters is equal 
to the rank. Many real world datasets can be assumed to satisfy 
this assumption. For example, the USPS dataset which consists 
of ten digits. We call such data matrices as low-rank matrices 
on graphs. The clustering on these dataset can be done by 
recovering a clean low-rank representation. 

B. PCA and Related Work 

For a dataset X G with n p-dimensional data vectors, 

standard PCA learns the projections or principal components 
W G of X on a c-dimensional orthonormal basis V G 

I^pxc, ^ 

< p by solving model 3 in Fig. 1. Though 


non-convex, this problem has a global minimum that can be 
computed using Singular Value Decomposition (SVD), giving 
a unique low-rank representation U = VW. 

A main drawback of PCA is its sensitivity to heavy-tailed 
noise due to the Frobenius norm in the objective function. 
Thus, a few strong corruptions can result in erratic principal 
components. Robust PCA (RPCA) proposed by Candes et al. 
[7] overcomes this problem by recovering the clean low-rank 
representation U from grossly corrupted X by solving model 
4 in Fig. 1. Here S represents the sparse matrix containing the 
errors and ||G||* denotes the nuclear norm of U, the tightest 
convex relaxation of rank(f/). 

Recently, many works related to low-rank or sparse repre¬ 
sentation recovery have been proposed to incorporate the data 
manifold information in the form of a discrete graph into the 
dimensionality reduction framework [16], [44], [11], [6], [38], 
[18], [17], [28], [9]. In fact, for PCA, this can be considered 
as a method of exploiting the local smoothness information in 
order to improve clustering quality. The graph smoothness of 
the principal components W using the graph Laplacian £ has 
been exploited in various works that explicitly learn W and 
the basis V. We refer to such models as factorized models. 
In this context Graph Laplacian PCA (GLPCA) was proposed 
in [16] (model 5 in Fig. 1) and Manifold Regularized Matrix 
Factorization (MMF) in [44] (model 6 in Fig. 1). Note that the 
orthonormality constraint in this model is on V, instead of the 
principal components W. Later on, the authors of [34] have 
generalized robust PCA by incorporating the graph smoothness 
(model 7 in Fig. 1) term directly on the low-rank matrix instead 
of principal components. They call it Robust PCA on Graphs 
(RPCAG). 

Models 4 to 8 can be used for clustering in the low 
dimensional space. However, each of them comes with its own 
weaknesses. GLPCA [16] and MMF [44] improve upon the 
classical PCA by incorporating graph smoothness but they are 
non-convex and susceptible to data corruptions. Moreover, the 
rank c of the subspace has to be specified upfront. RPCAG 
[34] is convex and builds on the robustness property of RPCA 
[ 7 ] by incorporating the graph smoothness directly on the 
low-rank matrix and improves both the clustering and low- 
rank recovery properties of PCA. However, it uses the nuclear 
norm relaxation that involves an expensive SVD step in every 
iteration of the algorithm. Although fast methods for the SVD 
have been proposed, based on randomization [41], [22], [26], 
Frobenius norm based representations [43], [27] or structured 
RPCA [2], its use in each iteration makes it hard to scale to 
large datasets. 

C. Our Contributions 

In this paper we propose a fast, scalable, robust and con¬ 
vex clustering and low-rank recovery method for potentially 
corrupted low-rank signals. Our contributions are: 

1) We propose an approximate low-rank recovery method 
for corrupted data by utilizing only the graph smooth¬ 
ness assumptions both between the samples and between 
the features. 
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Fig. 1. A summary of the matrix factorization methods with and without graph regularization. X G is the matrix of n p-dimensional data vectors, 

V G and W G are the learned factors. U G is the low-rank matrix and S' G is the sparse matrix. || • ||i?, || • ||* and || • ||i denote 

the Frobenius, nuclear and ii matrix norms respectively. The data manifold At information can be leveraged in the form of a discrete graph G using the 
graph Laplacian C G resulting in various Graph Regularized PCA models. 


2) Our theoretical analysis proves that the proposed model 
is able to recover approximate low-rank representations 
with a bounded error for clusterable data, where the 
number of clusters is equal to the rank. We call such 
a data matrix a low-rank matrix on the graph. 

3) Our model is convex and although non-smooth it can be 
solved efficiently, that is in linear time in the number 
of samples, with a few iterations of the well-known 
FISTA algorithm. The construction of the two graphs 
costs 0{n\ogn) time, where n is the number of data 
samples. 

4) The resulting algorithm is highly parallelizable and 
scalable for large datasets since it requires only the 
multiplication of two sparse matrices with full vectors 
and elementwise soft-thresholding operations. 

5) Our extensive experimentation shows that the recovered 
close-to-low-rank matrix is a good approximation of 
the low-rank matrix obtained by solving the expensive 
state-of-the-art method [34] which uses the much more 
expensive nuclear norm. This is observed even in the 
presence of gross corruptions in the data. 

D. Connections and differences with the state-of-the-art 

The idea of using two graph regularization terms has 
previously appeared in the work of matrix completion [19], co¬ 
clustering [12], NMF [35], [5] and more recently in the context 
of low-rank representation [42]. However, to the best of our 
knowledge all these models aim to improve the clustering 
quality of the data in the low-dimensional space. The co¬ 
clustering & NMF based models which use such a scheme 
[12], [35] suffer from non-convexity and the works of [19] and 
[42] use a nuclear-norm formulation which is computationally 
expensive and not scalable for big datasets. Our proposed 
method is different from these models in the following sense: 

• We do not target an improvement in the low-rank rep¬ 
resentation via graphs. Our method aims to solely re¬ 
cover an approximate low-rank matrix with dual-graph 


regularization only. The underlying motivation is that 
one can obtain a good enough low-rank representation 
without using expensive nuclear norm or non-convex 
matrix factorization. Note that the NMF-based method 
[35] targets the smoothness of factors of the low-rank 
while the co-clustering [12] focuses on the smoothness 
of the labels. Our method, on the other hand, targets 
directly the recovery of the low-rank matrix, and not the 
one of the factors or labels. 

m We introduce the concept of low-rank matrices on graphs 
and provide a theoretical justification for the success of 
our model. The use of PCA as a scalable and efficient 
clustering method using dual graph regularization has 
surfaced for the very first time in this paper. 

A summary of the notations used in this paper is presented 
in Tab. I. We first introduce our proposed formulation and its 
optimization solution in Sections II & II-A and then develop 
a sound motivation of the model in Section IV. 

II. Fast Robust PCA on Graphs (FRPCAG) 

Let Cl G be the graph Laplacian of the graph Gi 

connecting the different samples of X (columns of X) and 
£2 ^ the Laplacian of graph G 2 that connects the 

features of X (rows of X). The construction of these two 
graphs is described in Section III. We denote by U G 
the low-rank noiseless matrix that needs to be recovered from 
the measures X, then our proposed model can be written as: 

imn ||A - f/||i + 7 i tr(C/£i U^) + 72 tr(f/T £2 U). (1) 

This problem can be reformulated in the equivalent split form 
min||A||i + 7 itr([/£i[/^)+ 72 tr([/^£ 2 C/), (2) 

s.t. X = U + S, 

where S models the sparse outliers in the data X. The || • ||i 
denotes the element-wise Li norm of a matrix. Model (2) 
has close connections with the RPC AG [34]. In fact the 


























TABLE I 

A SUMMARY OF NOTATIONS USED IN THIS WORK 


Notation 

Terminology 

II-IIf 

matrix frobenius norm 

Mil 

matrix ii norm 

n 

number of data samples 

P 

number of features / pixels 

c 

dimension of the subspace 

k 

number of classes in the data set 

X G 

data matrix 

u e RP^^ 

low-rank noiseless approximation of X 

U = USIU ' 

SVD of the low-rank matrix U 

V G RP^^ 

left singular vectors of U / principal directions of U 

s 

singular values of U 

W G R'^^^ 

right singular vectors of U / principal components of U 

A G or RP^P 

adjacency matrix between samples / features of X 

D = diag(J 2 j 

diagonal degree matrix 

(7 

smoothing parameter of the Gaussian kernel 

Gi 

graph between the samples of X 

G2 

graph between the features of X 

(V,f) 

set of vertices, edges for graph 

71 

penalty for Gi Tikhonov regularization term 

72 

penalty for G2 Tikhonov regularization term 

K 

nearest neighbors for the construction of graphs 

Cl G R^^^ 

Laplacian for graph Gi 

C2 G RP^P 

Laplacian for graph G2 

Cl = QAQ ' 

eigenvalue decomposition of £1 

C2 = pnp ' 

eigenvalue decomposition of C2 


nuclear norm term in RPCAG has been replaced by another 
graph Tikhonov term. The two graph regularization terms help 
in retrieving an approximate low-rank representation U by 
encoding graph smoothness assumptions on U without using 
the expensive nuclear norm of RPCAG, therefore we call it 
Fast Robust PGA on Graphs (FRPCAG). The main idea of 
our work is summarized in the fig. of the first page of this 
paper. 

A. Optimization Solution 

We use the Fast Iterative Soft Thresholding Algorithm 
(FISTA) [3] to solve problem (1). Let ^ M be a convex, 

differentiable function with a ;5-Lipschitz continuous gradient 
Vg and h : ^ M a convex function with a proximity 

operator prox^ : ^ defined as: 

Woyixhiv) = argmin i||a: -y\\l + \h{x). 

X ^ 

Our goal is to minimize the sum + h{x), which is 
done efficiently with proximal splitting methods. More infor¬ 
mation about proximal operators and splitting methods for 
non-smooth convex optimization can be found in [ 8 ]. For 
model (1), g{U) = 71 tr(f/A H -72 tr(f/^ £2 f/) and 
h{U) = ||X — f/||i. The gradient of g becomes 

V,(f/) = 2(7if/£i+72/:2t/). (3) 

We define an upper bound on the Lipschitz constant /3 as 
P < P' = 271 II Cl II 2 + 272 II £2 II 2 where || C II 2 is the spectral 
norm (or maximum eigenvalue) of C. Moreover, the proximal 
operator of the function h is the ii soft-thresholding given by 
the elementwise operations (here o is the Hadamard product) 

prox;^^(f/) = X A- sgn(f/ — X) o max(|f/ — X| — A, 0). (4) 


The FISTA algorithm [3] can now be stated as Algorithm 1, 
where A is the step size (we use A = Jy), e the stopping 


Algorithm 1 FISTA for FRPCAG 


INPUT: Yi = A, Uo = X, U = 1, e > 0 

for ji = 1,... J do 

Uj = prox;,,ft(U - XjVg{Yj)) 

_ i+UI+4^ 

U+1 — o 


ifr.>i-is-iir<er,iii^ 

BREAK 

end if 
end for 

OUTPUT: Uj^i 


tolerance and J the maximum number of iterations. 


III. Graphs Construction 


We use two types of graphs Gi and G 2 in our proposed 
model. The graph Gi is constructed between the data samples 
or the columns of the data matrix and the graph G 2 is 
constructed between the features or the rows of the data matrix. 
The graphs are undirected and built using a standard and a 
fast K-nearest neighbor strategy. The first step consists of 
searching the closest neighbours for all the samples using 
Euclidean distances. We connect each Xi to its K nearest 
neighbors Xj, resulting in |£| number of connections. The K- 
nearest neighbors are non-symmetric. The second step consists 
of computing the graph weight matrix A as 



Xt^Xj)\\ 2 ^ connected to Xi 

otherwise. 


The parameter a can be set empirically as the average distance 
of the connected samples. Provided that this parameter is not 
big, it does not effect the final quality of our algorithm. Finally, 
in the third step, the normalized graph Laplacian C = I — 
is calculated, where D is the diagonal degree 
matrix. This procedure has a complexity of 0{ne) and each 
Aij can be computed in parallel. Our choice of normalized 
Laplacian is arbitrary and depends on the application under 
consideration. An advantage of using a normalized laplacian as 
compared to an unnormalized is that all the eigenvalues for the 
normalized laplacian lie between 0 and 2 for all the datasets. 
This eases the comparison of the spectra of the laplacians. The 
eigenvalues of the unnormalized laplacian can be unbounded 
and have different ranges for different datasets. Depending on 
the values of n and p the above computation can be done in 
two different ways. 

Strategy 1: For small n, p we can use the above strategy 
directly for both Gi and G 2 even if the dataset is corrupted. 
Although, the computation of A is O(n^), it should be noted 
that with sufficiently small n and p, the graphs Gi and G 2 
can still be computed in the order of a few seconds. 

Strategy 2: For big or high dimensional datasets, i.e, large 
n or large p or both, we can use a similar strategy but the 



































computations can be made efficient (O(nlogn)) using the 
FLANN library (Fast Library for Approximate Nearest Neigh¬ 
bors searches in high dimensional spaces) [24]. However, the 
quality of the graphs constructed using this strategy is slightly 
lower as compared to strategy 1 due to the approximate 
nearest neighbor search method. We describe the complexity 
of FLANN in detail in Section VI. 

Thus for our work the overall quality of graphs can be 
divided into 3 types. 

. Type A: Good sample graph Gi and good feature graph 
G 2 , both constructed using strategy 1. This case corre¬ 
sponds to small n and p. 

• Type B: Good sample graph Gi using strategy 1 and 
noisy feature graph G 2 using strategy 2. This case cor¬ 
responds to small n but large p. 

• Type C: Noisy sample graph Gi and noisy feature graph 
G 2 both constructed using strategy 2 for large n and p. 

We report the performance of FRPCAG for these three com¬ 
binations of graph types, thus the acronyms FRPCAG(A), 
FRPCAG(B) and FRPCAG(C). Although the graph quality is 
lower if FLANN is used for corrupted data, our experiments 
for MNIST dataset show that our proposed model attains better 
results than other state-of-the-art models even with low quality 
graphs. 

IV. Our Motivation: Low-rank matrix on graphs 

In this section we lay down the foundation and motivation 
of our method and take a step towards a theoretical analysis 
of FRPCAG. We build the motivation behind FRPCAG with 
a simple convincing demonstration. We start by answering 
the question: Why do we need two graphs? This discussion 
ultimately leads to the introduction of a new concept, the low- 
rank matrix on a graph. The latter models clusterable data and 
facilitates our theoretical analysis. 

A. The graph of features provides a basis for data 

Consider a simple example of the digit 3 from the traditional 
USPS dataset. We vectorize all the images and form a data 
matrix A, whose columns consist of different samples of digit 
3 from the USPS dataset. In order to motivate the need of the 
graph of features we build the 10 nearest neighbors graph (of 
features), i.e, a graph between the rows of X using the FLANN 
strategy of Section III. Fig. 2 shows the eigenvectors of the 
Laplacian denoted by P. We observe that they have a 3-like 
shape. In Fig. 2, we also plot the eigenvectors associated to the 
experimental covariance matrix G^. We observe that both sets 
of eigenvectors are similar. This is confirmed by computing 
the following matrix: 

r = p~^ CP (5) 

In order to measure the level of alignment between the 
orthogonal basis P and the one behind G, we use the following 

^The experimental covariance matrix is computed as C = , where n 

is the number of samples and X = X—fix for jix = Ef=i Yfj=i ^ij • 
This definition is motivated in [31]. 


ratio: 


^r(r) 





l|ciiag(r)||2 

lirilF 


( 6 ) 


When the two bases are aligned, the covariance matrix G 
and the graph Laplacian L are simultaneously diagonalizable, 
giving a ratio equal to 1. On the contrary, when the bases are 
not aligned, the ratio is close to ^, where p is the dimension 
of the dataset. Note that an alternative would be to compute 
directly the inner product between P and the eigenvectors of 
G. However, using F we implicitly weight the eigenvectors of 
G according to their importance given by their corresponding 
eigenvalues. 

In the special case of the digit 3, we obtain a ratio 
^^(rs) =0.97, meaning that the main covariance eigenvectors 
are well aligned to the graph eigenvectors. Fig. 2 shows a few 
eigenvectors of both sets and the matrix F. This effect has 



Fig. 2. Studying the number 3 of USPS. Left: Covariance eigenvectors 
associated with the 16 highest eigenvalues. Right: Laplacian eigenvectors 
associated to the 16 smallest non-zero eigenvalues. Because of stationarity, 
Laplacian eigenvectors are similar to the covariance eigenvectors. Bottom: 
Ls = P^CsP in dB. Note the diagonal shape of the matrix implying that 
P is aligned with the eigenvectors of C. 

been studied in [31] where the definition of stationary signals 
on graphs is proposed. A similar idea is also the motivation of 
the Laplacianfaces algorithm [13]. A closer look at the bottom 
figure of Fig. 2 shows that most of the energy in the diagonal 
is concentrated in the first few entries. This shows that the first 
few eigenvectors of the Laplacian are more aligned with the 
eigenvectors of the covariance matrix. This phenomena implies 
that the digit 3 of the USPS dataset is low-rank, i.e, only the 
first few eigenvectors (corresponding to the low eigenvalues) 
are enough to serve as the features for this dataset. 

Of course, FRPCAG also acts on the full dataset. Let us 
analyze how the graph eigenvectors evolve when all digits are 
taken into account. Fig. 3 shows the Laplacian and covariance 
eigenvectors for the full USPS dataset. Again we observe some 
alignment: ^^(r) = 0.82. 

From this example, we can conclude that every column 
of a low-rank matrix X lies approximately in the span of 
the eigenvectors Pk^ of the features graph, where k 2 denotes 











Covariance eigenvector 





Fig. 3. Studying the full USPS dataset. Left: Covariance eigenvectors 
associated with the 16 highest eigenvalues. Right: Laplacian eigenvectors 
associated to the 16 smallest non-zero eigenvalues. Because of stationarity, 
Laplacian eigenvectors are similar to the covariance eigenvectors. Bottom: 
r = CP in dB. Note the diagonal shape of the matrix implying that P 
is aligned with the eigenvectors of C. 


the eigenvectors corresponding to the smallest k 2 eigenvalues. 
This is similar to PCA, where a low-rank matrix is represented 
in the span of the first few principal directions or atoms of the 
basis. Alternately, the Laplacian eigenvectors are meaningful 
features for the USPS dataset. Let the eigenvectors P of C 2 
be divided into two sets {Pk^ ^ ^ 

Note that the columns of P^^ contain the eigenvectors corre¬ 
sponding to the low graph frequencies and P^^ contains those 
corresponding to higher graph frequencies. Then we can write, 
X = X"" P E, where X* is the low-rank part and E models 
the noise or corruptions. Thus, 

X = Pk^ApPk^A and 
X* = 

where A G P^ 2 xn ^ ^ 7 ^(p-/c 2 )xn pjg 2 it is also 

clear that whPWF ||P/e2A||i7 for a specific value of k 2 . 

B. The graph of samples provides embedding for data 

The smallest eigenvectors of the graph of samples provide 
an embedding of the data in the low-dimensional space [4]. 
This has a similar interpretation as the principal components in 
PCA. We argue that every row of a low-rank matrix lies in the 
span of the first few eigenvectors of the graph of samples. This 
is similar to representing every row of the low-rank matrix 
as the span of the principal components. Thus, the graph of 
samples £1 encodes a smooth non-linear map towards the 
principal components of the underlying manifold defined by 
the graph £ 1 . In other words, minimization with respect to 
ir{U £1 U^) forces the principal components of the data to be 
aligned with the eigenvectors Q of the graph £1 which corre¬ 
spond to the smallest eigenvalues Ay. This is the heart of many 
algorithms in clustering [25] and dimensionality reduction [4]. 
In our present application this term has two effects. Firstly, 
when the data has a class structure, the graph of samples 
enforces the low-rank U to benefit from this class structure. 


This results in an enhanced clustering of the low-rank signals. 
Secondly, it will force that the low-rank U of the signals 
is well represented by the first few Laplacian eigenvectors 
associated to low Ay. Let the eigenvectors Q of £1 be divided 
into two sets {Qk^ ^ ,Qki ^ pinx{n-ki)^^^ where ki 

denotes the eigenvectors in Q corresponding to the smallest 
ki eigenvalues. Note that the columns of Q/ci contain the 
eigenvectors corresponding to the low graph frequencies and 
contains those corresponding to higher graph frequencies. 
Then, we can write: 

^ = BQI^ + BQl^ and 
V* = BQl^ 

where B G and B G As argued in the 

previous subsection, \\BQJ^\\f <C \\BQI^\\f. 

C. Low-rank matrix on graphs 

From the above explanation related to the role of the 
two graphs, we can conclude the following facts about the 
representation of any clusterable low-rank matrix X*. 

1) It can be represented as a linear combination of the 
Laplacian eigenvectors of the graph of features, i.e, 
X-=Pk,A. 

2) It can also be represented as a linear combination of 
the Laplacian eigenvectors of the graph of samples, i.e, 
X*=BQl. 

As already pointed out, only the first ki or k 2 eigenvectors 
of the graphs correspond to the low frequency information, 
therefore, the other eigenvectors correspond to noise. We are 
now in a position to define low-rank matrix on graphs. 

Definition 1: A matrix X* is (/ci, /C 2 )-low-rank on the 
graphs £1 and £2 if (X*)^ G span((5/cj for alH = 1,... ,p, 
and (X*)y G span(P/c 2 ) for all j = l,...,n. The set of 
(/ci, /C 2 )-low-rank matrices on the graphs £1 and £2 is denoted 
by Cn{Qki,Pk^). 

We note here that X* G span(P/c 2 ) means that the columns 
of X* are in span(P/e 2 ), i.e, (X*)^ G sp8in{Pf.A^ for all i = 
1,.. ., n, where for any matrix A, {A)i is its column vector. 

D. Theoretical Analysis 

The lower eigenvectors Q/ci and Pk^ of £1 and £2 provide 
features for any X G CR{Qk^^Pkf)' Now we are ready 
to formalize our findings mathematically and prove that any 
solution of (1) yields an approximately low-rank matrix. In 
fact, we prove this for any proper, positive, convex and lower 
semi-continuous loss function (possibly ^p-norms || • || 1 , || • jjl, 
..., II • lip. We re-write (1) with a general loss function 

mm (t){U - X) + 7i tr([/ CiU^)+ 72 C2 U) ( 7 ) 

Before presenting our mathematical analysis we gather a 
few facts which will be used later: 

• We assume that the observed data matrix X satisfies 
X = X* -f P where X* G CRiQk^.Pkf) and E 
models noise/corruptions. Furthermore, for any X* G 





ClZ{Qk^^Pk 2 ) there exists a matrix C such that X* = 

Pk^CQ^. 

• — Qk-\_-^k\QP Qk-\_-^k\Q where 

G is a diagonal matrix of lower eigenvalues 

and Kki ^ 'Ji{n-ki)x{n-ki) ^ diagonal matrix of 

higher graph eigenvalues. All values in A are sorted in 
increasing order, thus 0 = Ao<Ai<'''< A/.^ < • • • < 
An-i. The same holds for £2 as well. 

• For a AT-nearest neighbors graph constructed from a 
/ci-clusterable data (along samples) one can expect 
A/ci/A/ci+i ^ 0 as A/ci - 0 and \k^ < A/^.+i. The same 
holds for the graph of features £2 as well. 

• For the proof of the theorem, we will use the fact that 

for any X G there exist A G 7 ^^ 2 xn ^ ^ 

n^ri-k 2 )xn X = Pk^ApPk^A, and B G 

and B G such that X = BQj^ + BQj^. 

Theorem 2: Let X* G jCP{Q, Pk 2 )^ 7 > 0, and E G 
T^pxn solution f/* G of (7) with 71 = 7 /A/ci+i, 

72 = 7 /^/c 2 +i and X = X* + E satisfies 


0(f/* -X) 

< 0 (^) 




■72 






711^111 




^k2 
+ 1 <^/C 2 + l 


)■ 


( 8 ) 


where Xki^^ki+i denote the ki^ki + 1 eigenvalues of £ 1 , 
<^/c 2 5 ^/c 2 +i denote the /c 2 , /c 2 + 1 eigenvalues of £ 2 . 

Proof: As f/* is a solution of (7), we have 


cp{u* - X) + 71 tr(C/* £i(C/*)^) + 72 tr((C/*)T £2 ? 7 *) 

< (/.(£) + 71 tr(X*Li(X*)T) + 72 tr((X*)T C 2 X*). (9) 


Using the facts that £1 = Qk^Ak^Qj^ +QklAk^Qkl and that 
there exists B G and B G such that U* = 

BQj^ + BQj^, we obtain 

tv{U* £i(f/*)^) = tr{BAk,B^) + tr{BAk,B^) 

> tv{Ak,B^B) > Xk,+i\\B\\% = Xk,+i\\U*QkA%. 

Then, using the fact that there exists C G such that 

X* = Pk^CQj^, we obtain 

tr(X*£i(X*)^) = tr(CAfc,C^) < AfeJ|C||^ = AfeJ|X*||2p. 


Similarly, we have 

tr((C/*7£2C/*)>a;fe,+i||7-r^C/*|||, 


tr((X*)T£ 2 X*) <a;fc,||X*|||.. 

Using the four last bounds in (9) yields 
cf>{U* -X)+ ^,Xk,+i\\U*Qk, III + 72UJk,+i\\PlU *III < 

(A(i;)+7iW;bxll^*lll+72C^fe.||X*|||, 
which becomes 

^{U* -X)+ j\\U*Qk, III + 71171IIf 


<<^(£)+7||X 


* 112 


A/ei 


+ 


^k2 


,^/Ci + l <^/C2 + l. 

for our choice of 71 and 72 . This terminates the proof. 


E. Remarks on the theoretical analysis 

( 8 ) implies that 

\\u*QkA\% + II7>1If < P{E) + II^IIf (p^ + -^) ■ 

7 V^/Ci +1 ^/C 2 + l/ 

The smaller t^e closer U* to 

£7^((5/ci,T/C 2 ) is. The above bound shows that to recover a 
low-rank matrix one should have large eigengaps Xk^+i — Xk^ 
and CJ/C 2+1 — ^/C 2 - This occurs when the rows and columns of 
X can be clustered into ki and k 2 clusters. Furthermore, one 
should also try to chose a metric f (or ^^-norm) that minimizes 
0(A). Clearly, the rank of U* is approximately min{/ci, /C 2 }. 

V. Working of FRPCAG 

The previous section presented a theoretical analysis of our 
model. In this section, we explain in detail the working of 
our model for any data. Like any standard low-rank recovery 
method, such as [7], our method is able to perform the 
following two operations for a clusterable data: 

1) Penalization of the singular values of the data. The 
penalization of the higher singular values, which corre¬ 
spond to high frequency components in the data results 
in the data cleaning. 

2) Determination of the clean left and right singular vec¬ 
tors. 

A. FRPCAG is a singular value penalization method 

In order to demonstrate how FRPCAG penalizes the singular 
values of the data we study another way to cater the graph 
regularization in the solution of the optimization problem 
which is contrary to the one presented in Section II-A. In 
Section II-A we used a gradient for the graph regularization 
terms 71 tr(f/ £1 U^) + 72 tr(f/^ £1 U) and used this gra¬ 
dient as an argument of the proximal operator for the soft- 
thresholding. What we did not point out there was that the 
solution of the graph regularizations can also be computed by 
proximal operators. It is due to the reason that using proximal 
operators for graph regularization (that we present here) is 
more computationally expensive. Assume that the prox of 
7 itr(f/ £1 U^) is computed first, and let Z be a temporary 
variable, then it can be written as: 

min||X-Z||| + 7 itr(Z£i 
z 

The above equation has a closed form solution which is given 
as: 

Z = X(/ + 7i £i)-i 

Now, compute the proximal operator for the term 
72tr(C/T £2 U) 

imn||Z-C/|||+ 72 tr([/^ £2 U) 

The closed form solution of the above equation is given as: 

U = {I + '~I2C2)~^Z 








Thus, the low-rank U can be written as: 

C/ = (J + 72£2)“'X(/ + 7i£i)-i 

after this the soft thresholding can be applied on U. 

Let the SVD of X, X = Ci = QKQ^ and 

£2 = , then we get: 

[/ = (/ + 72PAP^)-iI4S^W"7 (/ + 

= P(/ + 72A)-ipTl4I],VF7 Q(/ + 7ifi)-iQ^ 

thus, each singular value cFxi of X is penalized by 1/(1 + 
7 iAi)(l + 72 <^i)- Clearly, the above solution requires the 
computation of two inverses which can be computationally 
intractable for big datasets. 

B. Estimation of clean singular vectors 

The two graph regularization terms tr(f/ Ci U^) 
ir(U^ C 2 U^) encode a weighted penalization in the 
Laplacian basis. Again using £1 = QEQ^ and £2 = 
and U = be the SVD of P, we get 

71 tr(P£i P^)+ 72 tr(P^ £2 U) 

=71 tr(VSlL^QAg^lLSV^) + 72 ti {W^V^ PQP^V^W 

=7i tr(SlL^QAQ^VLS) + 72 tr(SV^PllP^ VS) 

=71 ii(W^ QKQ^WT.^) + 72 tr(V^PflP^ VS^) 

min{n,p} 

= S +72Wj(7Pj)^), (10) 

where Aj and ccj are the eigenvalues in the matrices A 
and ft respectively. The second step follows from V^V = I 
and the cyclic permutation invariance of the trace. In the 
standard terminology Wi and Vi are the principal components 
and principal directions of of the low-rank matrix U. From the 
above expression, the minimization is carried out with respect 
to the singular values and the singular vectors Vi^Wi. The 
minimization has the following effect: 

1) Minimize by performing an attenuation with the 
graph eigenvalues as explained earlier. 

2) When di is big, the principal components Wi are aligned 
with the graph eigenvectors qj for small values of Xj, 
i.e, the lower graph frequencies of £ 1 . The principal 
directions Vi are also aligned with the graph eigenvectors 
Pj for small values of uoj, i.e, the lower graph frequen¬ 
cies of £ 2 . This alignment makes sense as the higher 
eigenvalues correspond to the higher graph frequencies 
which constitute the noise in data. This is explained 
experimentally in the next section. 

C. Experimental Justification of working of ERPC AG 

Now we present an experimental justification for the work¬ 
ing of this model, as described in the previous subsection. In 
summary we illustrate that: 

1) The model recovers a close-to-low-rank representation. 

2) The principal components and principal directions of U 
align with the first few eigenvectors of their respective 


graphs, automatically revealing a low-rank and enhanced 
class structure. 

3) The singular values of the low-rank matrix obtained 
using our model closely approximate those obtained by 
nuclear norm based models even in the presence of 
corruptions. 

Our justification relies mostly on the quality of the singular 
values of the low-rank representation and the alignment of the 
singular vectors with their respective graphs. 

We perform an experiment with 1000 samples of the MNIST 
dataset belonging to two different classes (digits 0 and 1 ). 
We vectorize all the digits and form a data matrix X whose 
columns contain the digits. Then we compute a graph of 
samples between the columns of X and a graph of features 
between the rows of X as mentioned in Section III. We 
determine the clean low-rank U by solving model (1) and 
perform one SVD at the end U = . Finally, we do 

the clustering by performing k-means (k = 2 ) on the low- 
rank U. As argued in [21], if the data is arranged according 
to the classes, the matrix WW^ (where W are the principal 
components of the data) reveals the subspace structure. The 
matrix IVIV^ is also known as the shape interaction matrix 
tSIM) [15]. If the subspaces are orthogonal then SIM should 
acquire a block diagonal structure. Furthermore, as explained 
in Section II our model (1) tends to align the first few principal 
components Wi and principal directions Vi of U to the first 
few eigenvectors qj and pj of £1 and £2 respectively. Thus, 
it is interesting to observe the matrices LUV^Q and HV^P 
scaled with the singular values S of the low-rank matrix U, 
as justified by eq. (10). This scaling takes into account the 
importance of the eigenvectors that are associated to bigger 
singular values. 

Fig. 4 plots the matrix IVIV^, the corresponding clustering 
error, the matrices S, 'EW^Q, and EV^P for different values 
of 7 i and 72 from left to right. Increasing 71 and 72 from 1 
to 30 leads to 1) the penalization of the singular values in S 
resulting in a lower rank 2 ) alignment of the first few principal 
components Wi and principal directions Vi in the direction of 
the first few eigenvectors qj and pj of £1 and £2 respectively 

3) an enhanced subspace structure in WW^ and 4) a lower 
clustering error. Together the two graphs help in acquiring a 
low-rank structure that is suitable for clustering applications 
as well. 

Next we demonstrate that for data with or without corrup¬ 
tions, FRPCAG is able to acquire singular values as good 
as the nuclear norm based models, RPCA and RPCAG. We 
perform three clustering experiments on 30 classes of ORL 
dataset with no block occlusions, 15% block occlusions and 
25% block occlusions. Fig. 5 presents a comparison of the 
singular values of the original data with the singular values 
of the low-rank matrix obtained by solving RPCA, RPCAG 
and our model. The parameters for all the models are selected 
corresponding to the lowest clustering error for each model. It 
is straightforward to conclude that the singular values of the 
low-rank representation using our fast method closely approx¬ 
imate those of the nuclear norm based models irrespective of 
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Fig. 4. The matrices WW^, S, YlW^Q, P and the corresponding clustering errors obtained for different values of the weights on the two graph 
regularization terms for 1000 samples of MNIST dataset (digits 0 and 1). If U = is the SVD of U, then W corresponds to the matrix of principal 

components (right singular vectors of U) and V to the principal directions (left singular vectors of U). Let jCi = QAQ^ and £2 = PQP^ be the eigenvalue 
decompositions of Ci and C 2 respectively then Q and P correspond to the eigenvectors of Laplacians Ci and £ 2 . The block diagonal structure of WW^ 
becomes more clear by increasing 71 and 72 with a thresholding of the singular values in E. Further, the sparse structures of and EV^P towards 

the rightmost corners show that the number of left and right singular vectors which align with the eigenvectors of the Laplacians Ci and C 2 go on decreasing 
with increasing 71 and 72. This shows that the two graphs help in attaining a low-rank structure with a low clustering error. 
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Fig. 5. A comparison of singular values of the low-rank matrix obtained via our model, RPCA and RPC AG. The experiments were performed on the ORL 
dataset with different levels of block occlusions. The parameters corresponding to the minimum validation clustering error for each of the model were used. 


the level of corruptions. 

VI. Computational Complexity 

A. Complexity of Graph Construction 

For n p-dimensional vectors, the computational complexity 
of the FLANN algorithm is 0{pnK{\og{n)/ \og{K))) for the 
graph Gi between the samples and 0{pnK{\og{p)/ \og{K))) 
for the graph G 2 between the features, where K is the number 
of nearest neighbors. This has been shown in [32] & [24] . For 
a fixed K the complexity of Gi is 0{pn\og{n)) and that of 
G 2 is 0{np\og{p)). We use iT = 10 for all the experiments 
reported in this work. The effect of the variation of K on the 
performance of our model is studied briefly in Section VII. 

B. Algorithm Complexity 

1) FISTA: Let / denote the number of iterations for the 
algorithm to converge, p is the data dimension, n is the number 
of samples and c is the rank of the low-dimensional space. The 
computational cost of our algorithm per iteration is linear in 
the number of data samples n, i.e. 0{lpn) for I iterations. 

2) Final SVD: Our model, in order to preserve convexity, 
finds an approximately low-rank solution U without explicitly 
factorizing it. While this gives a great advantage, depending 
on the application we have in hand, we might need to provide 
explicitly the low dimensional representation in a factorized 
form. This can be done by computing an “economic” SVD of 
U after our algorithm has finished. 

Most importantly, this computation can be done in time 
that scales linearly with the number of samples for a fixed 
number of features p n. Let U = the SVD of U. 

The orthonormal basis V can be computed by the eigenvalue 
decomposition of the small pxp matrix UU^ = VEV^ that 
also reveals the singular values H = ^/E since UU^ is s.p.s.d. 
and therefore E is non-negative diagonal. Here we choose 
to keep only the c biggest singular values and corresponding 
vectors according to the application in hand (the procedure for 
determining c is explained in Section VII). Given V and S the 
sample projections are computed sls W = 'E~^V^U. 


The complexity of this SVD is 0{np^) -due to the multipli¬ 
cation UU^- and does not change the asymptotic complexity 
of our algorithm. Note that the standard economic SVD 
implementation in numerical analysis software typically does 
not use this simple trick, in order to achieve better numerical 
error. However, in most machine learning applications like the 
ones of interest in this paper, the compromise in terms of 
numerical error is negligible compared to the gains in terms 
of scalability. 

C. Overall Complexity 

The complexity of FISTA is 0{lpn), the graph Gi is 
0{pn\og{n)), G 2 is 0{pn\og{p)) and the final SVD step is 
0{np‘^). Given that p n, the overall complexity of our 
algorithm is 0(pn(log(n) + /+p-|-log(p))). Table II presents 
the computational complexities of all the models considered 
in this work (discussed in Section VII). 

D. Scalability 

The construction of graphs Gi and G 2 is highly scalable. 
For small n and p the strategy 1 of Section III can be used 
for the graphs construction and each of the entries of the 
adjacency matrix A can be computed in parallel once the 
nearest neighbors have been found. For large n an p the 
approximate K-nearest neighbors scheme (FLANN) is used 
for graphs construction which is highly scalable as well. Next, 
our proposed FISTA algorithm for FRPCAG requires two 
important computations at every iteration: 1) computation of 
proximal operator prox;^^(f/) and 2) the gradient Vg{Y). 
The former computation is given by the element-wise soft- 
thresholding (eq. (4)) that can be performed in parallel for 
all the entries of a matrix. The gradient computation, as 
given by eq. (3), involves matrix-matrix multiplications that 
involve sparse matrices Ci and £2 and can be performed very 
efficiently in parallel as well. 

VII. Results 

Experiments were done using two open-source toolboxes: 
the UNLocBoX [30] for the optimization part and the GSPBox 
[29] for the graph creation. The complete demo, code and 






















TABLE II 

Computational complexity of all the models considered in this work. I denotes the number of iterations for the algorithm to 

CONVERGE, p IS THE DIMENSION, n IS THE NUMBER OF SAMPLES AND C IS THE RANK OF THE LOW-DIMENSIONAL SPACE. ALL THE MODELS WHICH USE 
THE GRAPH Gi ARE MARKED BY ’ + ’. THE CONSTRUCTION OF GRAPH G 2 IS INCLUDED ONLY IN OUR MODEL (FRPCAG). NOTE THAT WE HAVE USED 
COMPLEXITY 0{np^) FOR ALL SVD COMPUTATIONS AND 0{ln) FOR APPROXIMATE EIGENVALUE DECOMPOSITION FOR NCUT AS PROPOSED IN [36], 
WHILE THE LATTER COULD BE USED FOR THE DECOMPOSITION NEEDED BY LE EVEN THOUGH NOT SPECIFIED IN [4]. 


Model 

Complexity Gi 

0{np\og{n)) 

Complexity G 2 

C>(nplog(p)) 

Complexity Algorithm 

for p ^ n 

Overall Complexity 

for p ^ n 

FRPCAG 

+ 

+ 

0{np{I + p)) 

C>(np(log(n) + p + / + log(p))) 

NCut [36] 


- 

0{ln) 

C>(n(plog(n) + 1 )) 

LE[ ] 

+ 

- 


G(n(plog(n) -h n^)) 

PCA 

- 

- 

0{p^n) 

G(np(plog(n) +p)) 

GLPCA [16] 

+ 

- 


G(n(plog(n) -h n^)) 

NMF [20] 

- 

- 

0{lnpc) 

0{lnpc) 

GNMF [6] 

-H 

- 

0{Inp6) 

G(np(/c-h log(n))) 

MMF [44] 

+ 

- 

G(((p + c)c^ +pc)/) 

G(((p + c)c^ -h pc)/ -h pn log(n)) 

RPCA [ ] 

- 

- 

0{lnp^) 

0(np{lp -h log(n))) 

RPCAG [34] 

+ 

- 

0{lnp^) 

0{npllp -h log(n))) 


datasets used for this work are available at . We perform 
two types of experiments corresponding to two applications 
of PCA. 

1) Data clustering in the low-dimensional space. 

2) Low-rank recovery: Static background separation from 
videos. 

We present extensive quantitative results for clustering but 
currently our experiments for low-rank recovery are limited 
to qualitative analysis only. This is because our work on 
approximating the low-rank representation using graphs is the 
first of its kind. The experiments on the low-rank background 
extraction from videos suffice as a proof-of-concept for the 
working of this model. 

We perform our clustering experiments on 7 benchmark 
databases: CMU PIE, ORL, YALE, COIL20, MNIST, USPS 
and MEEAT. CMU PIE, ORL and YALE are face databases 
with small pose variations. COIL20 is a dataset of objects 
with significant pose changes so we select the images for 
each object with less than 45 degrees of pose change. USPS 
and MNIST contain images of handwritten digits and MEeat 
consists of features extracted from handwritten numerals. The 
details of all datasets used are provided in Table III. 

TABLE III 

Details of the datasets used for clustering experiments in this 

WORK. 


Dataset 

Samples 

Dimension 

Classes 

CMU PIE 

1200 

32 X 32 

30 

ORL 

400 

56 X 46 

40 

COIL20 

1400 

32 X 32 

20 

YALE 

165 

32 X 32 

11 

MNIST 

50000 

28 X 28 

10 

USPS 

3500 

16 X 16 

10 

MFEAT 

400 

409 

10 


In order to evaluate the robustness of our model to gross 
corruptions we corrupt the datasets with two different types of 


errors 1) block occlusions and 2) random missing pixels. Block 
occlusions of three different sizes, i.e, 15%, 25% and 40% of 
the total size of the image are placed uniformly randomly 
in all the images of the datasets. Similarly, all the images 
of the datasets are also corrupted by removing 10%, 20%, 
30% and 40% pixels uniformly randomly. Separate clustering 
experiments are performed for each of the different types of 
corruptions. 

We compare the clustering performance of our model with 
10 other models including the state-of-art: 1) k-means on 
original data 2) Normalized Cut (NCut) [36] 3) Laplacian 
Eigenmaps (LE) [4] 4) Standard PCA 5) Graph Laplacian PCA 
(GLPCA) [16] 6) Manifold Regularized Matrix Eactorization 
(MME) [44] 7) Non-negative Matrix Eactorization (NME) 
[20] 8) Graph Regularized Non-negative Matrix Eactorization 
(GNME) [6] 9) Robust PCA (RPCA) [7] and 10) Robust PCA 
on Graphs (RPCAG) [34]. Eor ORL, CMU PIE, COIL20, 
YALE and USPS datasets we compare three different versions 
of our model corresponding to the three types of graphs Gi 
and G 2 . 

As mentioned in Section III, PRPCAG(A) corresponds to 
our model using good quality sample and feature graphs, 
PRPCAG(B) to the case using a good quality sample graph 
and approximate feature graph, and PRPCAG(C) to the case 
where approximate graphs were used both between samples 
and features. All other models for these datasets are evaluated 
using a good quality sample graph Gi. Due to the large size of 
the MNIST dataset, we use ELANN (strategy 2) to construct 
both graphs, therefore we get approximate versions in the 
presence of corruptions. Thus the experiments on MNIST 
dataset are kept separate from the rest of the datasets to 
emphasize the difference in the graph construction strategy. 
We also perform a separate set of experiments on the ORL 
dataset and compare the performance of our model with state- 
of-the-art nuclear norm based models, RPC A [7] and RPC AG 
[34], both with a good quality and an approximate graph. 
We perform this set of experiments only on the ORL dataset 






























(due to its small size) as the nuclear norm based models 
are computationally expensive. Finally, the experiments on 
MFeat dataset are only performed with missing values because 
block occlusions in non-image datasets correspond to an 
unrealistic assumption. The computational complexities of all 
these models are presented in Table II. 

Pre-processing: All datasets are transformed to zero-mean 
and unit standard deviation along the features for the RPCA, 
RPCAG and FRPCAG. For MMF the samples are additionally 
normalized to unit-norm. For NMF and GNMF only the unit- 
norm normalization is applied to all the samples of the dataset. 

Evaluation: We use clustering error as a metric to compare 
the clustering performance of various models. NCut, LE, PC A, 
GLPCA, MMF, NMF and GNMF are matrix factorization 
models that explicitly learn the principal components W. The 
clustering error for these models is evaluated by performing 
k-means on the principal components. RPCA, RPCAG and 
FRPCAG learn the low-rank matrix U. The clustering error 
for these models can be evaluated by performing k-means 
on 1) principal components W obtained by the SVD of 
the low-rank matrix U = VTW^ or 2) the low-rank U 
directly. Note that RPCA and RPCAG determine the exact 
low-rank representation U, whereas our model only shrinks 
singular values and therefore only recovers an approximate 
low-rank representation U. Thus, if one desires to use the 
principal components W for clustering, the dimension of 
the subspace (number of columns of W) can be decided 
by selecting the number of singular values greater than a 
particular threshold. However, this procedure requires SVD 
and can be expensive for big datasets. Instead, it is more 
feasible to perform clustering on the low-rank U directly. 
We observed that similar clustering results are obtained by 
using either W or U, however, for brevity these results are 
not reported. Due to the non-deterministic nature of k-means, 
it is run 10 times and the minimum error over all runs is 
reported. 

Parameter selection for various models: Each model has 
several parameters which have to be selected in the validation 
stage of the experiment. To perform a fair validation for each 
of the models we use a range of parameter values as presented 
in Table IV. For a given dataset, each of the models is run for 
each of the parameter tuples in this table and the parameters 
corresponding to minimum clustering error are selected for 
testing purpose. Furthermore, PCA, GLPCA, MMF, NMF and 
GNMF are non-convex models so they are run 10 times for 
each of the parameter tuple. RPCA, RPCAG and FRPCAG 
are convex so they are run only once. 

Parameter selection for Graphs: For all the experiments 
reported in this paper we use the following parameters for 
graphs Gi and G 2 . K-nearest neighbors = 10 and = 1. It 
is important to point out here that different types of data might 
call for slightly different parameters for graphs. However, for a 
given dataset, the use of same graph parameters (same graph 
quality) for all the graph regularized models ensures a fair 
comparison. 


TABLE IV 

Range of parameter values for each of the models considered 

IN THIS WORK. C IS THE RANK OR DIMENSION OF SUBSPACE, A IS THE 
WEIGHT ASSOCIATED WITH THE SPARSE TERM FOR ROBUST PCA 
FRAMEWORK [ ] AND 7 IS THE PARAMETER ASSOCIATED WITH THE 
GRAPH REGULARIZATION TERM. 


Model 

Para¬ 

meters 

Parameter 

Range 

NCut [36] 

c 

c G { 2 ^, 2 ^, • • • , min(n,p)} 

LE [4] 

PCA 

GLPCA [16] 

c,7 

c G {2^, 2^, • • • , min(n,p)} 

7 p using [16] 

P G {0.1,0.2, ••• ,0.9} 

MMF [44] 

c,7 

c e { 2 T 2 ^, ■ • • ,min(n,p)} 

{2-3, 2 - 2 ,... , 210 } 

NMF [20] 

c 

GNMF [ 6 ] 

c,7 

RPCA [7] 

A 

A e { . ^7 . : 

Y max(n,p) 

0.1 : ^ _ } 

Y max(n,p) 

7 e { 2 - 3 , 2 - 2 ,... , 210 } 

RPCAG [34] 


FRPCAG 

II1I2 

71j72 € { 1 , 2 , • • • , 100 } 


A. Clustering 

1) Comparison with Matrix Factorization Models: Fig. 6 
presents the clustering error for various matrix factorization 
and our proposed model. NMF and GNMF are not evaluated 
for the USPS and MFeat datasets as they are not originally 
non-negative. It can be seen that our proposed model FR- 
PCAG(A) with the two good quality graphs performs better 
than all the other models in most of the cases both in the 
presence and absence of data corruptions. Even ERPCAG(B) 
with a good sample graph Gi and a noisy feature graph 
G 2 performs reasonably well. This shows that our model 
is quite robust to the quality of graph G 2 . However, as 
expected ERPCAG(C) performs worse for ORE, CMU PIE, 
COIL20, YALE and USPS datasets as compared to other 
models evaluated with a good sample graph Gi. Einally, 
our model outperforms others in most of the cases for the 
interesting case of MNIST dataset where both graphs Gi and 
G 2 are noisy for all models under consideration. It is worth 
mentioning here that even though the absolute errors are quite 
high for ERPCAG on the MNIST dataset, it performs relatively 
better than the other models. As PCA is mostly used as a 
feature extraction or a pre-processing step for a variety of 
machine learning algorithms, a better absolute classification 
performance can be obtained for these datasets by using 
ERPCAG as a pre-processing step for supervised algorithms 
as compared to other PCA models. 

2) Comparison with Nuclear Norm based Models: Pig. 6 
also presents a comparison of the clustering error of our model 
with nuclear norm based models, i.e, RPCA and RPCAG for 
ORE dataset. This comparison is of specific interest because 
of the convexity of all the algorithms under consideration. As 
these models require an expensive SVD step on the whole 
low-rank matrix at every iteration of the algorithm, these 
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Fig. 6. A comparison of clustering error of our model with various dimensionality reduction models. The image data sets include: 1) ORL 2) CMU PIE 3) 
COIL20 and 4) YALE. The compared models are: 1) k-means 2) Normalized Cut (NCut) 3) Laplacian Eigenmaps (LE) [4] 4) Standard Principal Component 
Analysis (PCA) 5) Graph Laplacian PC A (GLPCA) [16] 6) Non-negative Matrix Factorization [20] 7) Graph Regularized Non-negative Matrix Factorization 
(GNMF) [6] 8) Manifold Regularized Matrix Factorization (MMF) [44] 9) Robust PCA (RPCA) [7] 10) Fast Robust PCA on Graphs (A) 11) Fast Robust PCA 
on Graphs (B) and 12) Fast Robust PCA on Graphs (C). Two types of corruptions are introduced in the data: 1) Block occlusions and 2) Random missing 
values. NCut, LE, GLPCA, MMF and GNMF are evaluated with a good sample graph Gi. FRPCA(A) corresponds to our model evaluated with a good 
sample and a good feature graph, FRPCA(B) to a good sample graph and a noisy feature graph and FRPCA(C) to a noisy sample and feature graph. NMF 
and GNMF require non-negative data so they were not evaluated for the USPS and MFeat datasets because they are negative as well. MFeat is a non-image 
dataset so it is not evaluated with block occlusions. Due to the large size of the MNIST dataset, we use FLANN algorithm (strategy 2) to construct the graphs, 
therefore we get noisy graphs in the presence of corruptions. 










































































































































































































































































































































































Fig. 7. Principal Components of 1000 samples of digits 0 and 1 of the MNIST dataset in 2D space. For this experiment all the digits were corrupted 
randomly with 15% missing pixels. Our proposed model (lower right) attains a good separation between the digits which is comparable and even better than 
other state-of-the-art dimensionality reduction models. 



Background Separation from Videos via PCA 
Original RPCA RPCAG FRPCAG 


Fig. 8. Static background separation from three videos. Each row shows the actual frame (left), recovered static low-rank background using RPCA, RPCAG 
and our proposed model. The first row corresponds to the video of a restaurant food counter, the second row to the shopping mall lobby and the third to an 
airport lobby. In all the three videos the moving people belong to the sparse component. Thus, our model is able to accurately separate the static portion from 
the three frames as good as the RPCAG. Our model converged in less than 2 minutes for each of the three videos, whereas RPCA and RPCAG converged in 
more than 45 minutes. 






































experiments are performed on small ORL dataset. Clearly, our 
proposed model FRPCAG(A) performs better than the nuclear 
norm based models even in the presence of large fraction 
of gross errors. Interestingly, even FRPCAG(B) with a noisy 
graph G 2 performs better than RPC AG with a good graph Gi. 
Furthermore, the performance of FRPCAG(C) with two noisy 
graphs is comparable to RPCAG with noisy graph, but still 
better than RPCA. 

B. Principal Components 

Fig. 7 shows the principal components of 1000 samples of 
MNIST dataset in two dimensional space obtained by various 
dimensionality reduction models. 500 samples of digit 0 and 
1 each are chosen and randomly corrupted by 15% missing 
pixels for this experiment. Clearly, our proposed model attains 
a good separation between the digits 0 and 1 (represented by 
blue and red points respectively) comparable with other state- 
of-the-art dimensionality reduction models. 

C. Effect of the number of nearest neighbors for graphs 

In order to demonstrate the effect of number of nearest 
neighbors K on the clustering performance of our model we 
perform a small experiment on the ORL dataset which has 400 
images corresponding to 40 classes (10 images per class). We 
perform clustering for different values of K = 5,10,25,40. 
The clustering errors are 17.5%, 17%, 23% and 31% respec¬ 
tively. Interestingly the minimum clustering error occurs for 
K = 5,10 which is less or equal to the number of images 
per class. Thus, when the number of nearest neighbors K is 
approximately equal to or less than the number of images 
per class then the images of the same class are more well 
connected and those across the classes have weak connections. 
This results in a lower clustering error. A good way to set K 
is to use some prior information about the average number 
of samples per class or the rank of the dataset. For our 
experiments we use iT = 10 for all the datasets and this value 
works quite well. The value of K also depends on the number 
of data samples. For big datasets, sparser graphs (obtained 
with lower values of K) tend to be more useful. For example, 
our experiments show that for the MNIST dataset (70,000 
samples), K = 10 is again a good value, even though the 
average number of samples per class is 7000. 

D. Static background separation from videos 

In order to demonstrate the effectiveness of our model to 
recover low-rank static background from videos we perform 
experiments on 1000 frames of 3 videos available online. All 
the frames are vectorized and arranged in a matrix X whose 
columns correspond to frames. The graph Gi is constructed 
between the 1000 frames (columns of X) of the video and the 
graph G 2 is constructed between the pixels of the frames (rows 
of X) following the methodology of Section III. Both graphs 
for all the videos are constructed without the prior knowledge 
of the mask of sparse errors (moving people). Fig. 8 shows the 
recovery of low-rank frames for one actual frame of each of the 
videos. The leftmost plot in each row shows the actual frame. 


the other three show the recovered low-rank representations 
using RPCA, RPCAG and our proposed model (FRPCAG). 
The first row corresponds to a frame from the video of a 
restaurant food counter, the second row to the shopping mall 
lobby and the third row to an airport lobby. In each of the 
three plots it can be seen that our proposed model is able 
to separate the static backgrounds very accurately from the 
moving people which do not belong to the static ground truth. 
Our model converged in less than 2 minutes for each of the 
three videos, whereas RPCA and RPCAG converged in more 
than 45 minutes. 

E. Computational Time 

Table V presents the computational time and number of 
iterations for the convergence of FRPCAG, RPCAG and 
RPCA on different sizes and dimensions of the datasets. We 
also present the time needed for the graph construction. The 
computation is done on a single core machine with a 3.3 GHz 
processor without using any distributed or parallel computing 
tricks. An oc in the table indicates that the algorithm did not 
converge in 4 hours. It is notable that our model requires a very 
small number of iterations to converge irrespective of the size 
of the dataset. Furthermore, the model is orders of magnitude 
faster than RPCA and RPCAG. This is clearly observed from 
the experiments on MNIST dataset where our proposed model 
is 100 times faster than RPCAG. Specially for MNIST dataset 
with 25000 samples, RPCAG and RPCA did not converge 
even in 4 hours whereas FRPCAG converged in less than a 
minute. 

To demonstrate the scalability of our model for big datasets, 
we perform an experiment on the US census 1990 dataset 
available at the UCI machine learning repository. This dataset 
consists of approximately 2.5 million samples and 68 fea¬ 
tures. The approximate K-nearest neighbors graph construction 
strategy using the FLANN algorithm took only 540 secs to 
construct Gi between 2.5 million samples and 42.3 secs, to 
construct G 2 between 68 features. We do not compare the 
performance of this model with other state-of-the-art models 
as the ground truth for this dataset is not available. However, 
we run our algorithm in order to see how long it takes to 
recover a low-rank representation for this dataset. It took 65 
minutes and 200 iterations for the algorithm to converge on a 
single core machine with 3.3 GHz of CPU power. 

VIH. Conclusion 

We present Fast Robust PCA on Graphs (FRPCAG), a fast 
dimensionality reduction algorithm for mining clusters from 
high dimensional and large low-rank datasets. The idea lies 
on the novel concept of low-rank matrices on graphs. The 
power of the model lies in its ability to effectively exploit 
the hidden information about the intrinsic dimensionality of 
the smooth low-dimensional manifolds on which reside the 
clusterable signals and features of the data. Therefore, it targets 
an approximate recovery of low-rank signals by exploiting the 
local smoothness assumption of the samples and features of the 
data via graph structures only. In short our method leverages 1) 


TABLE V 

Computation times (in seconds) eor graphs Gi,G2, FRPCAG, RPCAG, RPCA and the number oe iterations to converge eor 

DIEEERENT DATASETS. THE COMPUTATION IS DONE ON A SINGLE CORE MACHINE WITH A 3.3 GHZ PROCESSOR WITHOUT USING ANY DISTRIBUTED OR 
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Dataset 
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13.7 
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10 
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16 
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350 
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68 

- 
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42.3 

3900 
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smoothness of the samples on a sample graph and 2) smooth¬ 
ness of the features on a feature graph. The proposed method 
is convex, scalable and efficient and tends to outperform 
several other state-of-the-art exact low-rank recovery methods 
in clustering tasks that use the expensive nuclear norm. In an 
ordinary clustering task FRPCAG is approximately 100 times 
faster than nuclear norm based methods. The double graph 
structure also plays an important role towards the robustness 
of the model to gross corruptions. Furthermore, the singular 
values of the low-rank matrix obtained via FRPCAG closely 
approximate those obtained via nuclear norm based methods. 
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